In [ ]:
 
TITLE: A Data-Driven Approach to Urban Mobility Enhancement
Authored by: Akintomiwa Aremu James
Duration: {90} mins
Level: {Intermediate}
Pre-requisite Skills: Python (pandas, matplotlib, geopandas), data analysis, machine learning and basic understanding of urban planning concept
Scenario

{As an urban planner tasked with improving pedestrian infrastructure and public transportation services, I need to understand the relationship between pedestrian activity peaks and the locations of bus stops within a given area. By analyzing these patterns, I aim to provide recommendations for optimizing bus stop locations, enhancing pedestrian facilities, and reducing congestion to promote a healthier urban environment..}

What this use case will teach you

At the end of this use case you will:

  • Data manipulation and analysis using pandas
  • Advanced visualization techniques with matplotlib and geopandas for geospatial data analysis.
  • Spatial analysis using geopandas to understand geographic relationships between pedestrian activity and bus stop locations.
  • Implementation and interpretation of clustering algorithms for identifying spatial patterns in data.
  • Integration of machine learning techniques (such as clustering) with spatial analysis to derive actionable insights for urban planning.
  • Application of data-driven decision-making processes in urban planning contexts.ntexts.
A brief Walkthrough

In urban planning, understanding the dynamics of pedestrian activity and public transportation infrastructure is crucial for creating sustainable and livable cities. This use case focuses on analyzing the correlation between pedestrian activity peaks and the locations of bus stops using Python libraries such as pandas, matplotlib, and geopandas. By examining these relationships, we aim to provide insights for urban planners to optimize infrastructure and enhance the urban environment.}

writing out the dependencies¶

In [1]:
# Dependencies
import warnings
warnings.filterwarnings("ignore")

import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io
import seaborn as sns
import folium
from folium.plugins import HeatMap
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
pd.set_option('display.max_columns', None)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
In [ ]:
 

calling the three datasets¶

pedestrian counts dataset¶

bus stop datasets¶

sensor reading datasets¶

In [130]:
# **Preferred Method**: Export Endpoint
from io import StringIO


def API_unlimited(datasetName):

    dataset_id = datasetName
    # https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/
    #dataset_id = 'pedestrian-counting-system-monthly-counts-per-hour'
    
    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    #apikey = api_key
    dataset_id = dataset_id
    format = 'csv'
    
    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC',
       # 'api_key': apikey
    }
    
    # GET request
    response = requests.get(url, params=params)
    
    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        datasetName = pd.read_csv(StringIO(url_content), delimiter=';')
        print(datasetName.sample(10, random_state=999)) # Test
        return datasetName
    else:
        print(f'Request failed with status code {response.status_code}')
In [3]:
dataset_id_1 = 'pedestrian-counting-system-monthly-counts-per-hour'
dataset_id_2 = 'sensor-readings-with-temperature-light-humidity-every-5-minutes-at-8-locations-t'
dataset_id_3 = 'bus-stops'

pedestrian_hour = API_unlimited(dataset_id_1)
       sensor_name                  timestamp  locationid  direction_1  \
299310    Col700_T  2023-06-11T00:00:00+00:00           9           72   
273077    SprFli_T  2024-01-16T19:00:00+00:00          75           30   
230538    Bou688_T  2023-08-22T21:00:00+00:00          58          794   
545967    FLDegC_T  2024-03-12T00:00:00+00:00          69          211   
41658     BouHbr_T  2023-07-06T06:00:00+00:00          10          275   
311686    BouBri_T  2023-06-03T18:00:00+00:00          57            2   
503749    WestWP_T  2024-02-09T08:00:00+00:00          71           10   
362060     Col12_T  2023-12-08T17:00:00+00:00          18          471   
90696         AG_T  2023-09-29T13:00:00+00:00          29           71   
340770      ACMI_T  2023-08-06T06:00:00+00:00          72          352   

        direction_2  total_of_directions                    location  
299310           93                  165  -37.81982992, 144.95102555  
273077           18                   48  -37.81515276, 144.97467661  
230538          127                  921  -37.81686075, 144.95358075  
545967          134                  345  -37.81687226, 144.96559144  
41658            61                  336  -37.81876474, 144.94710545  
311686            4                    6   -37.8176735, 144.95025594  
503749            4                   14  -37.81235775, 144.97136962  
362060          126                  597  -37.81344862, 144.97305353  
90696            94                  165  -37.81965809, 144.96863453  
340770          448                  800  -37.81726338, 144.96872809  
In [ ]:
 
In [5]:
# View the pedestrian counts dataset
pedestrian_hour
Out[5]:
sensor_name timestamp locationid direction_1 direction_2 total_of_directions location
0 SprFli_T 2023-04-24T21:00:00+00:00 75 36 17 53 -37.81515276, 144.97467661
1 SprFli_T 2023-04-25T00:00:00+00:00 75 28 50 78 -37.81515276, 144.97467661
2 SprFli_T 2023-04-25T01:00:00+00:00 75 63 63 126 -37.81515276, 144.97467661
3 SprFli_T 2023-04-25T02:00:00+00:00 75 85 89 174 -37.81515276, 144.97467661
4 SprFli_T 2023-04-25T08:00:00+00:00 75 365 59 424 -37.81515276, 144.97467661
... ... ... ... ... ... ... ...
549971 474Fl_T 2024-03-18T14:00:00+00:00 141 5 15 20 -37.81997273, 144.95834911
549972 Hammer1584_T 2024-03-18T02:00:00+00:00 142 349 285 634 -37.81970749, 144.96795734
549973 Hammer1584_T 2024-03-18T03:00:00+00:00 142 282 173 455 -37.81970749, 144.96795734
549974 Hammer1584_T 2024-03-18T08:00:00+00:00 142 260 261 521 -37.81970749, 144.96795734
549975 Hammer1584_T 2024-03-18T10:00:00+00:00 142 157 146 303 -37.81970749, 144.96795734

549976 rows × 7 columns

In [ ]:
 

Fetching sensor reading datasets¶

In [6]:
sensor_reading = API_unlimited(dataset_id_2)
                       timestamp               mac  boardtype  boardid  \
16436  2015-02-06T07:20:00+00:00  0013a20040b31583          1      509   
41268  2014-12-16T20:40:00+00:00  0013a20040b5b318          1      502   
15000  2014-12-28T15:30:00+00:00  0013a20040b31583          1      509   
26069  2015-03-20T02:20:00+00:00  0013a20040b516f6          1      507   
49093  2015-03-09T19:55:00+00:00  0013a20040b31571          1      510   
19453  2015-01-15T07:55:00+00:00  0013a20040b31583          1      509   
50144  2015-03-17T03:30:00+00:00  0013a20040b31583          1      509   
15189  2014-12-30T03:50:00+00:00  0013a20040b5b337          1      511   
34161  2015-04-23T23:35:00+00:00  0013a20040b31583          1      509   
1324   2015-02-08T10:55:00+00:00  0013a20040b31571          1      510   

       temp_max  temp_min  temp_avg  light_max  light_min  light_avg  \
16436      38.4      38.4      38.4       97.8       97.8       97.8   
41268      17.7      17.7      17.7       94.3       94.3       94.3   
15000      18.4      18.4      18.4        3.7        3.7        3.7   
26069      17.1      17.1      17.1       91.7       91.7       91.7   
49093      15.5      15.5      15.5        6.9        6.9        6.9   
19453      19.0      19.0      19.0       93.4       93.4       93.4   
50144      25.8      25.8      25.8       95.4       95.4       95.4   
15189      23.9      23.9      23.9       97.8       97.8       97.8   
34161      15.2      15.2      15.2        2.3        2.3        2.3   
1324       17.7      17.7      17.7        1.6        1.6        1.6   

       humidity_min  humidity_max  humidity_avg model   latitude   longitude  \
16436          15.0          15.0          15.0   ENV -37.819904  144.940485   
41268          37.6          37.6          37.6   ENV -37.814610  144.979018   
15000          62.0          62.0          62.0   ENV -37.819904  144.940485   
26069          39.6          39.6          39.6   ENV -37.814922  144.982258   
49093          47.5          47.5          47.5   ENV -37.819712  144.941325   
19453          56.5          56.5          56.5   ENV -37.819904  144.940485   
50144          30.2          30.2          30.2   ENV -37.819904  144.940485   
15189          24.8          24.8          24.8   ENV -37.819500  144.941889   
34161          67.8          67.8          67.8   ENV -37.819904  144.940485   
1324           54.3          54.3          54.3   ENV -37.819712  144.941325   

       elevation           location               rowid  position  \
16436        NaN  Docklands Library  509-20150206072000       NaN   
41268      22.57    Fitzroy Gardens  502-20141216204000       NaN   
15000        NaN  Docklands Library  509-20141228153000       NaN   
26069      38.79    Fitzroy Gardens  507-20150320022000       NaN   
49093       2.74  Docklands Library  510-20150309195500       NaN   
19453        NaN  Docklands Library  509-20150115075500       NaN   
50144       0.03  Docklands Library  509-20150317033000       NaN   
15189        NaN  Docklands Library  511-20141230035000       NaN   
34161       0.03  Docklands Library  509-20150423233500       NaN   
1324         NaN  Docklands Library  510-20150208105500       NaN   

                    geolocation  
16436  -37.8199043, 144.9404851  
41268  -37.8146097, 144.9790177  
15000  -37.8199043, 144.9404851  
26069  -37.8149218, 144.9822582  
49093  -37.8197121, 144.9413253  
19453  -37.8199043, 144.9404851  
50144  -37.8199043, 144.9404851  
15189  -37.8195002, 144.9418888  
34161  -37.8199043, 144.9404851  
1324   -37.8197121, 144.9413253  
In [ ]:
 
In [7]:
### viewing the sensor reading datasets
sensor_reading.head()
Out[7]:
timestamp mac boardtype boardid temp_max temp_min temp_avg light_max light_min light_avg humidity_min humidity_max humidity_avg model latitude longitude elevation location rowid position geolocation
0 2015-01-24T10:45:00+00:00 0013a20040b31571 1 510 19.4 19.4 19.4 0.9 0.9 0.9 52.7 52.7 52.7 ENV -37.819712 144.941325 NaN Docklands Library 510-20150124104500 NaN -37.8197121, 144.9413253
1 2015-01-24T11:15:00+00:00 0013a20040b5b337 1 511 19.7 19.7 19.7 10.6 10.6 10.6 50.2 50.2 50.2 ENV -37.819500 144.941889 NaN Docklands Library 511-20150124111500 NaN -37.8195002, 144.9418888
2 2015-01-24T11:15:00+00:00 0013a20040b31583 1 509 19.7 19.7 19.7 3.1 3.1 3.1 57.9 57.9 57.9 ENV -37.819904 144.940485 NaN Docklands Library 509-20150124111500 NaN -37.8199043, 144.9404851
3 2015-01-24T11:55:00+00:00 0013a20040b31583 1 509 19.7 19.7 19.7 3.1 3.1 3.1 53.7 53.7 53.7 ENV -37.819904 144.940485 NaN Docklands Library 509-20150124115500 NaN -37.8199043, 144.9404851
4 2015-01-24T11:55:00+00:00 0013a20040b31571 1 510 18.7 18.7 18.7 1.0 1.0 1.0 48.6 48.6 48.6 ENV -37.819712 144.941325 NaN Docklands Library 510-20150124115500 NaN -37.8197121, 144.9413253
In [8]:
#checking for the shape
sensor_reading.shape
Out[8]:
(56570, 21)

fetching bus stop dataset¶

In [9]:
bus_stop = API_unlimited(dataset_id_3)
                                geo_point_2d  \
293   -37.78737016259562, 144.96918092237397   
8    -37.837547087144706, 144.98191138368836   
30    -37.82480198399865, 144.97076232908503   
308    -37.818314889062094, 144.956839508202   
289   -37.81105987177411, 144.95869339408225   
109   -37.78077459328419, 144.95138857277198   
45     -37.79443959174042, 144.9295031556217   
243   -37.803343440196116, 144.9693670992385   
273    -37.80282843793904, 144.9479395483275   
135   -37.80111524772101, 144.96674878780823   

                                             geo_shape  prop_id  addresspt1  \
293  {"coordinates": [144.96918092237397, -37.78737...        0    0.000000   
8    {"coordinates": [144.98191138368836, -37.83754...        0   41.441167   
30   {"coordinates": [144.97076232908503, -37.82480...        0   26.353383   
308  {"coordinates": [144.956839508202, -37.8183148...        0   35.877984   
289  {"coordinates": [144.95869339408225, -37.81105...        0   31.787580   
109  {"coordinates": [144.95138857277198, -37.78077...   107426   55.825150   
45   {"coordinates": [144.9295031556217, -37.794439...        0    2.826674   
243  {"coordinates": [144.9693670992385, -37.803343...        0   10.914450   
273  {"coordinates": [144.9479395483275, -37.802828...        0   13.532624   
135  {"coordinates": [144.96674878780823, -37.80111...        0    5.228496   

     addressp_1 asset_clas               asset_type  objectid   str_id  \
293           0    Signage  Sign - Public Transport     39748  1252536   
8            78    Signage  Sign - Public Transport      2922  1248743   
30          200    Signage  Sign - Public Transport     15210  1239404   
308         285    Signage  Sign - Public Transport     44101  1268402   
289         239    Signage  Sign - Public Transport     36816  1252743   
109         306    Signage  Sign - Public Transport      7176  1244570   
45          299    Signage  Sign - Public Transport     23024  1577042   
243         117    Signage  Sign - Public Transport     16758  1240396   
273         123    Signage  Sign - Public Transport     29192  1251207   
135         179    Signage  Sign - Public Transport     17860  1240314   

     addresspt  asset_subt                       model_desc   mcc_id  \
293          0         NaN  Sign - Public Transport 1 Panel  1252536   
8       107419         NaN  Sign - Public Transport 1 Panel  1248743   
30      540076         NaN  Sign - Public Transport 1 Panel  1239404   
308     105393         NaN  Sign - Public Transport 1 Panel  1268402   
289     577288         NaN  Sign - Public Transport 1 Panel  1252743   
109     111342         NaN  Sign - Public Transport 1 Panel  1244570   
45      106320         NaN  Sign - Public Transport 1 Panel  1577042   
243     612989         NaN  Sign - Public Transport 1 Panel  1240396   
273     107985         NaN  Sign - Public Transport 1 Panel  1251207   
135     106109         NaN  Sign - Public Transport 1 Panel  1240314   

     roadseg_id                                       descriptio model_no  
293       22508  Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
8         22245  Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
30        22466  Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
308       20118  Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
289       20026  Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
109           0  Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
45        21693  Sign - Public Transport 1 Panel Bus Stop Type 1     P.16  
243       20556  Sign - Public Transport 1 Panel Bus Stop Type 3     P.16  
273       21015  Sign - Public Transport 1 Panel Bus Stop Type 8     P.16  
135       20530  Sign - Public Transport 1 Panel Bus Stop Type 3     P.16  
In [10]:
#viewing the bust stops data sets
bus_stop.head(3)
Out[10]:
geo_point_2d geo_shape prop_id addresspt1 addressp_1 asset_clas asset_type objectid str_id addresspt asset_subt model_desc mcc_id roadseg_id descriptio model_no
0 -37.80384165792465, 144.93239283833262 {"coordinates": [144.93239283833262, -37.80384... 0 76.819824 357 Signage Sign - Public Transport 355 1235255 570648 NaN Sign - Public Transport 1 Panel 1235255 21673 Sign - Public Transport 1 Panel Bus Stop Type 13 P.16
1 -37.81548699581418, 144.9581794249902 {"coordinates": [144.9581794249902, -37.815486... 0 21.561304 83 Signage Sign - Public Transport 600 1231226 548056 NaN Sign - Public Transport 1 Panel 1231226 20184 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
2 -37.81353897396532, 144.95728334230756 {"coordinates": [144.95728334230756, -37.81353... 0 42.177187 207 Signage Sign - Public Transport 640 1237092 543382 NaN Sign - Public Transport 1 Panel 1237092 20186 Sign - Public Transport 1 Panel Bus Stop Type 8 P.16
In [11]:
#The shape
bus_stop.shape
Out[11]:
(309, 16)

processing my three datasets for analysis¶

pedestrian_hour¶
sensor_reading¶
bus_stop¶
In [12]:
pedestrian_hour.head()
Out[12]:
sensor_name timestamp locationid direction_1 direction_2 total_of_directions location
0 SprFli_T 2023-04-24T21:00:00+00:00 75 36 17 53 -37.81515276, 144.97467661
1 SprFli_T 2023-04-25T00:00:00+00:00 75 28 50 78 -37.81515276, 144.97467661
2 SprFli_T 2023-04-25T01:00:00+00:00 75 63 63 126 -37.81515276, 144.97467661
3 SprFli_T 2023-04-25T02:00:00+00:00 75 85 89 174 -37.81515276, 144.97467661
4 SprFli_T 2023-04-25T08:00:00+00:00 75 365 59 424 -37.81515276, 144.97467661

working on the pedestrian _count dataset¶¶

In [13]:
### trying to get number of pedestrian from each row
pedestrian_counts = {}

# Iterate through each row in the DataFrame
for index, row in pedestrian_hour.iterrows():
    # Extract location ID, location, timestamp, and pedestrian count from the row
    location_id = row['locationid']
    location = row['location']
    timestamp = row['timestamp']
    count = 1  # Since we're counting rows, each row represents one pedestrian
    
    # Create a unique key using location ID and timestamp
    key = (location_id, location, timestamp)
    
    # Increment pedestrian count for the key
    if key in pedestrian_counts:
        pedestrian_counts[key] += count
    else:
        pedestrian_counts[key] = count

# Create lists to store the aggregated data
location_ids = []
locations = []
timestamps = []
counts = []

# Iterate through the pedestrian_counts dictionary and append data to lists
for (location_id, location, timestamp), count in pedestrian_counts.items():
    location_ids.append(location_id)
    locations.append(location)
    timestamps.append(timestamp)
    counts.append(count)

# Create a new DataFrame to store the aggregated pedestrian counts
ped_df = pd.DataFrame({
    'locationid': location_ids,
    'location': locations,
    'timestamp': timestamps,
    'pedestrian_count': counts
})

# Print the new DataFrame
ped_df.head()
Out[13]:
locationid location timestamp pedestrian_count
0 75 -37.81515276, 144.97467661 2023-04-24T21:00:00+00:00 1
1 75 -37.81515276, 144.97467661 2023-04-25T00:00:00+00:00 1
2 75 -37.81515276, 144.97467661 2023-04-25T01:00:00+00:00 1
3 75 -37.81515276, 144.97467661 2023-04-25T02:00:00+00:00 1
4 75 -37.81515276, 144.97467661 2023-04-25T08:00:00+00:00 1
In [14]:
ped_df.shape
Out[14]:
(549967, 4)

working on sensor reading dataset¶

In [15]:
sensor_reading.columns
Out[15]:
Index(['timestamp', 'mac', 'boardtype', 'boardid', 'temp_max', 'temp_min',
       'temp_avg', 'light_max', 'light_min', 'light_avg', 'humidity_min',
       'humidity_max', 'humidity_avg', 'model', 'latitude', 'longitude',
       'elevation', 'location', 'rowid', 'position', 'geolocation'],
      dtype='object')
In [16]:
##selecting the important column for my analysis into a new dataframe sensor_df
sensor_df = sensor_reading[['timestamp','temp_avg','light_avg','humidity_avg', 'geolocation']]
sensor_df.head()
Out[16]:
timestamp temp_avg light_avg humidity_avg geolocation
0 2015-01-24T10:45:00+00:00 19.4 0.9 52.7 -37.8197121, 144.9413253
1 2015-01-24T11:15:00+00:00 19.7 10.6 50.2 -37.8195002, 144.9418888
2 2015-01-24T11:15:00+00:00 19.7 3.1 57.9 -37.8199043, 144.9404851
3 2015-01-24T11:55:00+00:00 19.7 3.1 53.7 -37.8199043, 144.9404851
4 2015-01-24T11:55:00+00:00 18.7 1.0 48.6 -37.8197121, 144.9413253

working on bus stops dataset¶¶

In [17]:
bus_stop.columns
Out[17]:
Index(['geo_point_2d', 'geo_shape', 'prop_id', 'addresspt1', 'addressp_1',
       'asset_clas', 'asset_type', 'objectid', 'str_id', 'addresspt',
       'asset_subt', 'model_desc', 'mcc_id', 'roadseg_id', 'descriptio',
       'model_no'],
      dtype='object')
In [18]:
## selecting important columns for my analysis into a new variable
b_stops = bus_stop[['geo_point_2d','str_id','roadseg_id']]
b_stops.head()
Out[18]:
geo_point_2d str_id roadseg_id
0 -37.80384165792465, 144.93239283833262 1235255 21673
1 -37.81548699581418, 144.9581794249902 1231226 20184
2 -37.81353897396532, 144.95728334230756 1237092 20186
3 -37.82191394843844, 144.95539345270072 1232777 22174
4 -37.83316401267591, 144.97443745130263 1271914 22708
In [19]:
b_stops.shape
Out[19]:
(309, 3)

renaming the variable name of the location column for bustops and sensor dataset to be able to merge it¶

In [20]:
## renaming the variable name of the location column for bustops and sensor dataset to be able to merge it
new_sensor=sensor_df.rename(columns={'geolocation':'location'})
new_bstops=b_stops.rename(columns={'geo_point_2d':'location'})
In [21]:
new_sensor.head()
#new_bstops.head()
Out[21]:
timestamp temp_avg light_avg humidity_avg location
0 2015-01-24T10:45:00+00:00 19.4 0.9 52.7 -37.8197121, 144.9413253
1 2015-01-24T11:15:00+00:00 19.7 10.6 50.2 -37.8195002, 144.9418888
2 2015-01-24T11:15:00+00:00 19.7 3.1 57.9 -37.8199043, 144.9404851
3 2015-01-24T11:55:00+00:00 19.7 3.1 53.7 -37.8199043, 144.9404851
4 2015-01-24T11:55:00+00:00 18.7 1.0 48.6 -37.8197121, 144.9413253
In [ ]:
 

Merging the 3 processed dataset together¶

In [22]:
#spliting of the loaction in ped_df and new_bstops into long and lat
ped_df[['latitude', 'longitude']] = ped_df['location'].str.split(', ', expand=True)
new_bstops[['latitude', 'longitude']]=new_bstops['location'].str.split(', ', expand=True)
new_sensor[['latitude', 'longitude']]=new_sensor['location'].str.split(', ', expand=True)
In [23]:
new_bstops.head()
ped_df.head()
new_sensor.head()
Out[23]:
timestamp temp_avg light_avg humidity_avg location latitude longitude
0 2015-01-24T10:45:00+00:00 19.4 0.9 52.7 -37.8197121, 144.9413253 -37.8197121 144.9413253
1 2015-01-24T11:15:00+00:00 19.7 10.6 50.2 -37.8195002, 144.9418888 -37.8195002 144.9418888
2 2015-01-24T11:15:00+00:00 19.7 3.1 57.9 -37.8199043, 144.9404851 -37.8199043 144.9404851
3 2015-01-24T11:55:00+00:00 19.7 3.1 53.7 -37.8199043, 144.9404851 -37.8199043 144.9404851
4 2015-01-24T11:55:00+00:00 18.7 1.0 48.6 -37.8197121, 144.9413253 -37.8197121 144.9413253
In [24]:
#converting the location for the pedestrian dataset to float
ped_df[['latitude','longitude']] = ped_df[['latitude','longitude']].astype(float)
In [25]:
#df1['latitude_rounded'] = df1['latitude'].apply(lambda x: round(float(x), 7))
ped_df[['latitude', 'longitude']] = ped_df[['latitude', 'longitude']].apply(lambda x: x.apply(lambda y: round(float(y), 7)))
In [26]:
ped_df.head()
Out[26]:
locationid location timestamp pedestrian_count latitude longitude
0 75 -37.81515276, 144.97467661 2023-04-24T21:00:00+00:00 1 -37.815153 144.974677
1 75 -37.81515276, 144.97467661 2023-04-25T00:00:00+00:00 1 -37.815153 144.974677
2 75 -37.81515276, 144.97467661 2023-04-25T01:00:00+00:00 1 -37.815153 144.974677
3 75 -37.81515276, 144.97467661 2023-04-25T02:00:00+00:00 1 -37.815153 144.974677
4 75 -37.81515276, 144.97467661 2023-04-25T08:00:00+00:00 1 -37.815153 144.974677
In [27]:
#converting the location for the busstops dataset to float
new_bstops[['latitude','longitude']] = new_bstops[['latitude','longitude']].astype(float)
new_bstops[['latitude','longitude']]=new_bstops[['latitude','longitude']].apply(lambda x: x.apply(lambda y: round(float(y),7)))
In [28]:
 new_bstops.head()
Out[28]:
location str_id roadseg_id latitude longitude
0 -37.80384165792465, 144.93239283833262 1235255 21673 -37.803842 144.932393
1 -37.81548699581418, 144.9581794249902 1231226 20184 -37.815487 144.958179
2 -37.81353897396532, 144.95728334230756 1237092 20186 -37.813539 144.957283
3 -37.82191394843844, 144.95539345270072 1232777 22174 -37.821914 144.955394
4 -37.83316401267591, 144.97443745130263 1271914 22708 -37.833164 144.974437
In [29]:
#converting the location for the sensor_reading dataset to float
new_sensor[['latitude','longitude']] = new_sensor[['latitude','longitude']].astype(float)
new_sensor[['latitude','longitude']]=new_sensor[['latitude','longitude']].apply(lambda x: x.apply(lambda y: round(float(y),7)))
In [30]:
new_sensor.head()
Out[30]:
timestamp temp_avg light_avg humidity_avg location latitude longitude
0 2015-01-24T10:45:00+00:00 19.4 0.9 52.7 -37.8197121, 144.9413253 -37.819712 144.941325
1 2015-01-24T11:15:00+00:00 19.7 10.6 50.2 -37.8195002, 144.9418888 -37.819500 144.941889
2 2015-01-24T11:15:00+00:00 19.7 3.1 57.9 -37.8199043, 144.9404851 -37.819904 144.940485
3 2015-01-24T11:55:00+00:00 19.7 3.1 53.7 -37.8199043, 144.9404851 -37.819904 144.940485
4 2015-01-24T11:55:00+00:00 18.7 1.0 48.6 -37.8197121, 144.9413253 -37.819712 144.941325
In [31]:
#merging the datasets together
new_df = pd.concat([ped_df, new_sensor,new_bstops])
In [32]:
new_df.head()
Out[32]:
locationid location timestamp pedestrian_count latitude longitude temp_avg light_avg humidity_avg str_id roadseg_id
0 75.0 -37.81515276, 144.97467661 2023-04-24T21:00:00+00:00 1.0 -37.815153 144.974677 NaN NaN NaN NaN NaN
1 75.0 -37.81515276, 144.97467661 2023-04-25T00:00:00+00:00 1.0 -37.815153 144.974677 NaN NaN NaN NaN NaN
2 75.0 -37.81515276, 144.97467661 2023-04-25T01:00:00+00:00 1.0 -37.815153 144.974677 NaN NaN NaN NaN NaN
3 75.0 -37.81515276, 144.97467661 2023-04-25T02:00:00+00:00 1.0 -37.815153 144.974677 NaN NaN NaN NaN NaN
4 75.0 -37.81515276, 144.97467661 2023-04-25T08:00:00+00:00 1.0 -37.815153 144.974677 NaN NaN NaN NaN NaN
In [33]:
#i want to check for the percentage of the missing alue in each column
missing_percent = new_df.isna().sum().sort_values()/len(new_df)*100
missing_percent
Out[33]:
location             0.000000
latitude             0.000000
longitude            0.000000
timestamp            0.050919
locationid           9.372889
pedestrian_count     9.372889
temp_avg            90.678030
light_avg           90.678030
humidity_avg        90.678030
str_id              99.949081
roadseg_id          99.949081
dtype: float64
In [34]:
plt.figure(figsize=(7,3))
new_par = missing_percent[missing_percent >= 5].plot.bar()
plt.gca().set_xlabel("columns")
plt.gca().set_ylabel("counts")
plt.gca().set_title("percentage of missing value")
plt.grid()
plt.show()
No description has been provided for this image

After merging the three datasets together.It resulted to a lot of nan for some of my important column which i believe is not good for my analysis so i decided to treat the datasets one after the other to derive different insight and pattern. before generating a certain amount of samples for clustering and decision making

In [ ]:
 

working with the datasets one after the other¶

making some analysis from each datasets¶

trying to • Visualize the pedestrian count data over time to identify patterns and peaks in pedestrian activity.¶

In [35]:
## from pedestrian dataset 
ped_df.head()
Out[35]:
locationid location timestamp pedestrian_count latitude longitude
0 75 -37.81515276, 144.97467661 2023-04-24T21:00:00+00:00 1 -37.815153 144.974677
1 75 -37.81515276, 144.97467661 2023-04-25T00:00:00+00:00 1 -37.815153 144.974677
2 75 -37.81515276, 144.97467661 2023-04-25T01:00:00+00:00 1 -37.815153 144.974677
3 75 -37.81515276, 144.97467661 2023-04-25T02:00:00+00:00 1 -37.815153 144.974677
4 75 -37.81515276, 144.97467661 2023-04-25T08:00:00+00:00 1 -37.815153 144.974677
In [36]:
ped_df['timestamp'] = pd.to_datetime(ped_df['timestamp'])
In [37]:
ped_df.head()
Out[37]:
locationid location timestamp pedestrian_count latitude longitude
0 75 -37.81515276, 144.97467661 2023-04-24 21:00:00+00:00 1 -37.815153 144.974677
1 75 -37.81515276, 144.97467661 2023-04-25 00:00:00+00:00 1 -37.815153 144.974677
2 75 -37.81515276, 144.97467661 2023-04-25 01:00:00+00:00 1 -37.815153 144.974677
3 75 -37.81515276, 144.97467661 2023-04-25 02:00:00+00:00 1 -37.815153 144.974677
4 75 -37.81515276, 144.97467661 2023-04-25 08:00:00+00:00 1 -37.815153 144.974677

getting the count of pedesrians hourly¶

In [38]:
ped_df['date'] =ped_df['timestamp'].dt.date
#ped_daily = ped_df.groupby('date')['pedestrian_count'].sum()
In [39]:
ped_df['hourly'] = ped_df['timestamp'].dt.floor('H')
In [40]:
ped_daily = ped_df.groupby('date', as_index=False)['pedestrian_count'].sum()
In [41]:
ped_hourly = ped_df.groupby('hourly',as_index=False)['pedestrian_count'].sum()
In [42]:
ped_hourly
Out[42]:
hourly pedestrian_count
0 2023-03-31 13:00:00+00:00 71
1 2023-03-31 14:00:00+00:00 69
2 2023-03-31 15:00:00+00:00 68
3 2023-03-31 16:00:00+00:00 63
4 2023-03-31 17:00:00+00:00 64
... ... ...
7510 2024-03-18 10:00:00+00:00 81
7511 2024-03-18 11:00:00+00:00 80
7512 2024-03-18 12:00:00+00:00 79
7513 2024-03-18 13:00:00+00:00 78
7514 2024-03-18 14:00:00+00:00 72

7515 rows × 2 columns

In [43]:
ped_daily
Out[43]:
date pedestrian_count
0 2023-03-31 751
1 2023-04-01 1592
2 2023-04-02 1651
3 2023-04-03 1681
4 2023-04-04 1665
... ... ...
341 2024-03-14 1930
342 2024-03-15 1935
343 2024-03-16 1945
344 2024-03-17 1904
345 2024-03-18 1216

346 rows × 2 columns

plotting the time series of pedestrian count for daily and hourly¶

In [44]:
plt.figure(figsize=(15,6))

# Subplot 1: Pedestrian counts over time (daily)
plt.subplot(1, 2, 1)  # 1 row, 2 columns, 1st subplot
plt.plot(ped_daily['date'], ped_daily['pedestrian_count'], marker="o", linestyle="-")
plt.title('Pedestrian Count Over Time (Daily)')
plt.xlabel('Date')
plt.ylabel('Pedestrian Count')
plt.grid(True)
plt.xticks(rotation=45)

# Ensuring tight layout for the first subplot
plt.tight_layout()

# Subplot 2: Pedestrian counts over time (hourly)
plt.subplot(1, 2, 2)  # 1 row, 2 columns, 2nd subplot
plt.plot(ped_hourly['hourly'], ped_hourly['pedestrian_count'], marker='o', linestyle="-")
plt.title('Pedestrian Count Over Time (Hourly)')
plt.xlabel('Hour')
plt.ylabel('Pedestrian Count')
plt.grid(True)
plt.xticks(rotation=45)

# Ensuring tight layout takes into account the second subplot as well
plt.tight_layout()

# Display the figure with both subplots
plt.show()
No description has been provided for this image

Peak Activity: A notable pattern observed in the data is the occurrence of days with significantly high pedestrian counts, ranging between 1750 and 2000. These peaks suggest periods of increased pedestrian activity that could be attributed to specific factors such as weekday rush hours, public events, or favorable weather conditions.

Low Activity: Conversely, the data also revealed days with markedly low pedestrian counts, approximately around 250. These troughs in pedestrian activity may indicate periods of reduced foot traffic, possibly due to adverse weather conditions, weekdays with no significant events, or other deterrents to outdoor activity.

In [45]:
ped_df['days_of_week'] = ped_df['timestamp'].dt.day_name()
In [46]:
count_by_day = ped_df.groupby('days_of_week',as_index=False)['pedestrian_count'].sum()

trying to get the day of the week with highest pedestrian traffic¶

In [47]:
count_by_day
Out[47]:
days_of_week pedestrian_count
0 Friday 75574
1 Monday 79832
2 Saturday 80036
3 Sunday 78689
4 Thursday 76348
5 Tuesday 79470
6 Wednesday 80027
In [48]:
# putting the ordeer of the day in a variable
day_order = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
ped_df['days_of_week'] = pd.Categorical(ped_df['timestamp'].dt.day_name(), categories=day_order, ordered=True)

# Group by 'days_of_week' and sum 'pedestrian_count'
count_by_day = ped_df.groupby('days_of_week', as_index=False)['pedestrian_count'].sum()

# Plotting
plt.figure(figsize=(8, 5))

# When plotting, specify 'x' and 'y' explicitly to ensure correct columns are used
plt.bar(count_by_day['days_of_week'], count_by_day['pedestrian_count'], color='skyblue')
plt.ylim(50000, 100000)
plt.yticks(np.arange(50000, 105000, 5000))
plt.title('Total Pedestrian Count by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Total Pedestrian Count')  # Changed "Average" to "Total" since we're summing counts
plt.xticks(rotation=45)
plt.grid(axis='y', linestyle='--')
plt.show()
No description has been provided for this image

Peak Pedestrian Days: Monday, Wednesday, and Saturday were identified as the days with the highest pedestrian traffic. These days exhibit significantly higher pedestrian volumes compared to other days of the week, indicating a pattern of increased activity that could be attributed to specific weekly events, market days, or leisure activities typical of these days.

Lowest Pedestrian Traffic: Friday was observed to have the lowest pedestrian count among all days of the week. This reduction in pedestrian activity could reflect a weekly variation in social or commercial patterns, such as alternative entertainment options, travel patterns, or shopping

trying to get the significant concentration of pedestrian traffic in specific area¶

In [49]:
location_totals = ped_df.groupby('location')['pedestrian_count'].sum().reset_index()
In [50]:
location_totals
Out[50]:
location pedestrian_count
0 -37.79432415, 144.92973378 6682
1 -37.79453803, 144.93036194 6738
2 -37.79690473, 144.96440378 5007
3 -37.79808192, 144.96721013 6854
4 -37.79844526, 144.96411782 4293
... ... ...
87 -37.8204637, 144.94126826 1780
88 -37.82129925, 144.96879309 7297
89 -37.82293543, 144.9471751 7190
90 -37.82401776, 144.95604426 4414
91 -37.82590962, 144.96185972 1546

92 rows × 2 columns

In [51]:
# Sort by total counts (or use 'location_averages' if working with averages)
location_totals_sort = location_totals.sort_values(by='pedestrian_count', ascending=False)
top_15_location = location_totals_sort.head(15)
In [52]:
plt.figure(figsize=(10, 6))
sns.barplot(x='pedestrian_count', y='location', data=top_15_location, palette='viridis')

plt.title('Total Pedestrian Traffic by Location')
plt.xlabel('Total Pedestrian Count')
plt.ylabel('Location')
plt.grid(axis='x', linestyle='--')
plt.show()
No description has been provided for this image

High Pedestrian Traffic Concentration: The graph reveals a significant concentration of pedestrian traffic in specific areas, with the top location registering the highest count. This suggests a strong preference or need for pedestrian access in these areas, potentially driven by commercial, recreational, or transit-related activities.

Variability Among Top Locations: There is noticeable variability in pedestrian counts among the top 15 locations. While the top locations show exceptionally high pedestrian traffic, there is a gradual decrease as we move down the list, indicating a steep drop-off in foot traffic outside of the most frequented areas.

Location Characteristics: The busiest locations likely share common characteristics that attract high pedestrian volumes, such as proximity to public transportation hubs, commercial districts, or key urban attractions

mapping-------out the peestrian traffic in different loactions¶

In [53]:
# Prepare data for the HeatMap
data = ped_df[['latitude', 'longitude', 'pedestrian_count']].values.tolist()
# Initialize your map around the center of your dataset
map_center = [ped_df['latitude'].mean(), ped_df['longitude'].mean()]
m = folium.Map(location=map_center, zoom_start=13)

# Add a HeatMap layer to the map
HeatMap(data).add_to(m)

# Display the map
m
Out[53]:
Make this Notebook Trusted to load map: File -> Trust Notebook

The heatmap actually display the key insights, such as:

High Traffic Areas: which shows Locations with the highest pedestrian counts, which correlate mostly with commercial zones, tourist attractions, public transport hubs, or other points of interest, suggesting areas of high economic activity or social gathering.

analysis on sensory data¶

In [54]:
new_sensor.head()
Out[54]:
timestamp temp_avg light_avg humidity_avg location latitude longitude
0 2015-01-24T10:45:00+00:00 19.4 0.9 52.7 -37.8197121, 144.9413253 -37.819712 144.941325
1 2015-01-24T11:15:00+00:00 19.7 10.6 50.2 -37.8195002, 144.9418888 -37.819500 144.941889
2 2015-01-24T11:15:00+00:00 19.7 3.1 57.9 -37.8199043, 144.9404851 -37.819904 144.940485
3 2015-01-24T11:55:00+00:00 19.7 3.1 53.7 -37.8199043, 144.9404851 -37.819904 144.940485
4 2015-01-24T11:55:00+00:00 18.7 1.0 48.6 -37.8197121, 144.9413253 -37.819712 144.941325
In [55]:
new_sensor.shape
Out[55]:
(56570, 7)
In [56]:
new_sensor['timestamp'] = pd.to_datetime(ped_df['timestamp'])
new_sensor['date'] =new_sensor['timestamp'].dt.date
In [57]:
sensor_daily_temp = new_sensor.groupby('date',as_index=False)['temp_avg'].sum()
sensor_daily_light = new_sensor.groupby('date',as_index=False)['light_avg'].sum()
sensor_daily_humidity = new_sensor.groupby('date',as_index=False)['humidity_avg'].sum()
In [58]:
sensor_daily_light
Out[58]:
date light_avg
0 2023-04-24 2352.9
1 2023-04-25 16790.6
2 2023-04-26 18654.6
3 2023-04-27 2086.7
4 2023-04-28 14384.0
... ... ...
120 2023-09-05 28948.9
121 2023-09-06 30754.3
122 2023-09-07 24099.7
123 2023-09-08 22530.0
124 2023-09-09 11584.9

125 rows × 2 columns

visualizing the pattern of temperature light and humidity over time¶

In [59]:
plt.figure(figsize=(15, 5))
plt.subplot(1,3,1)
# Plotting temperature as a scatter plot
plt.scatter(sensor_daily_temp['date'], sensor_daily_temp['temp_avg'], label='Temperature', color='red', alpha=0.5)
plt.xlabel('Timestamp')
plt.ylabel('Average Temperature')
plt.title('Temperature Over Time')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()

# Optionally, plot light and humidity on separate figures or on the same figure with different colors
plt.subplot(1,3,2)
plt.scatter(sensor_daily_light['date'], sensor_daily_light['light_avg'], label='Light', color='blue', alpha=0.5)
plt.xlabel('Timestamp')
plt.ylabel('Average light')
plt.title('light Over Time')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()

plt.subplot(1,3,3)
plt.scatter(sensor_daily_humidity['date'], sensor_daily_humidity['humidity_avg'], label='Humidity', color='green', alpha=0.5)
plt.xlabel('Timestamp')
plt.ylabel('Average humidity')
plt.title('humidity Over Time')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()  # Adjust layout to make room for the rotated x-axis labels
plt.show()
No description has been provided for this image

tryin to do the analysis of the time series data for temperature, light, and humidity reveals a pattern of environmental conditions¶

In [60]:
plt.figure(figsize=(15, 5))
plt.subplot(1,3,1)
# Plotting temperature as a scatter plot
plt.scatter(new_sensor['timestamp'], new_sensor['temp_avg'], label='Temperature', color='red', alpha=0.5)
plt.xlabel('Timestamp')
plt.ylabel('Average Temperature')
plt.title('Temperature Over Time')
plt.xticks(rotation=45)

# Optionally, plot light and humidity on separate figures or on the same figure with different colors
plt.subplot(1,3,2)
plt.scatter(new_sensor['timestamp'], new_sensor['light_avg'], label='Light', color='blue', alpha=0.5)
plt.xlabel('Timestamp')
plt.ylabel('Average light')
plt.title('light Over Time')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()

plt.subplot(1,3,3)
plt.scatter(new_sensor['timestamp'], new_sensor['humidity_avg'], label='Humidity', color='green', alpha=0.5)
plt.xlabel('Timestamp')
plt.ylabel('Average humidity')
plt.title('humidity Over Time')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()  # Adjust layout to make room for the rotated x-axis labels
plt.show()
No description has been provided for this image

Temperature Analysis: The temperature readings predominantly fluctuate within the range of 12°C to 25°C. This indicates a moderate climate over the observed period, with temperatures maintaining a relatively mild profile. The consistency within this temperature range suggests a stable environment, likely with minimal extreme weather events during the monitoring period.

Light Analysis: Light levels exhibit a pronounced pattern of highs and lows, corresponding closely with the diurnal cycle. This pattern underscores the natural variation in light intensity associated with the time of day, with peak light levels occurring during daylight hours and minimal levels during nighttime. The presence of both high and low extremes on the same occasions indicates a clear and expected transition between day and night over the period of observation.

Humidity Analysis: Humidity readings predominantly span from 20% to 60%, indicating a range that encompasses dry to moderately humid conditions. This variability in humidity levels could be influenced by several factors, including geographic location, prevailing weather patterns, and seasonal changes. The observed range suggests an environment where humidity levels are neither excessively dry nor overly humid, maintaining conditions conducive to a wide variety of natural and human activities.

showing the temperature heatmap on the map¶

In [61]:
import folium
from folium.plugins import HeatMap

# Create a map centered around the average latitude and longitude
map_ = folium.Map(location=[new_sensor['latitude'].mean(), new_sensor['longitude'].mean()], zoom_start=5)

# Add a heatmap layer
HeatMap(new_sensor[['latitude', 'longitude', 'temp_avg']], radius=20).add_to(map_)

map_.save('heatmap.html')
map_
Out[61]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [ ]:
 

trying to integrate sensor reading data with pedestrian count data to analyze how environmental factors influence pedestrian behavior.¶

In [62]:
new_sensor.head()
Out[62]:
timestamp temp_avg light_avg humidity_avg location latitude longitude date
0 2023-04-24 21:00:00+00:00 19.4 0.9 52.7 -37.8197121, 144.9413253 -37.819712 144.941325 2023-04-24
1 2023-04-25 00:00:00+00:00 19.7 10.6 50.2 -37.8195002, 144.9418888 -37.819500 144.941889 2023-04-25
2 2023-04-25 01:00:00+00:00 19.7 3.1 57.9 -37.8199043, 144.9404851 -37.819904 144.940485 2023-04-25
3 2023-04-25 02:00:00+00:00 19.7 3.1 53.7 -37.8199043, 144.9404851 -37.819904 144.940485 2023-04-25
4 2023-04-25 08:00:00+00:00 18.7 1.0 48.6 -37.8197121, 144.9413253 -37.819712 144.941325 2023-04-25
In [63]:
ped_df.head()
Out[63]:
locationid location timestamp pedestrian_count latitude longitude date hourly days_of_week
0 75 -37.81515276, 144.97467661 2023-04-24 21:00:00+00:00 1 -37.815153 144.974677 2023-04-24 2023-04-24 21:00:00+00:00 Monday
1 75 -37.81515276, 144.97467661 2023-04-25 00:00:00+00:00 1 -37.815153 144.974677 2023-04-25 2023-04-25 00:00:00+00:00 Tuesday
2 75 -37.81515276, 144.97467661 2023-04-25 01:00:00+00:00 1 -37.815153 144.974677 2023-04-25 2023-04-25 01:00:00+00:00 Tuesday
3 75 -37.81515276, 144.97467661 2023-04-25 02:00:00+00:00 1 -37.815153 144.974677 2023-04-25 2023-04-25 02:00:00+00:00 Tuesday
4 75 -37.81515276, 144.97467661 2023-04-25 08:00:00+00:00 1 -37.815153 144.974677 2023-04-25 2023-04-25 08:00:00+00:00 Tuesday
In [64]:
ped_df.shape
Out[64]:
(549967, 9)
In [65]:
new_ped = ped_df.iloc[:,:-2]
new_ped.head()
Out[65]:
locationid location timestamp pedestrian_count latitude longitude date
0 75 -37.81515276, 144.97467661 2023-04-24 21:00:00+00:00 1 -37.815153 144.974677 2023-04-24
1 75 -37.81515276, 144.97467661 2023-04-25 00:00:00+00:00 1 -37.815153 144.974677 2023-04-25
2 75 -37.81515276, 144.97467661 2023-04-25 01:00:00+00:00 1 -37.815153 144.974677 2023-04-25
3 75 -37.81515276, 144.97467661 2023-04-25 02:00:00+00:00 1 -37.815153 144.974677 2023-04-25
4 75 -37.81515276, 144.97467661 2023-04-25 08:00:00+00:00 1 -37.815153 144.974677 2023-04-25
In [66]:
new_ped['date'] = pd.to_datetime(new_ped['date'])
u_years = new_ped['date'].dt.year.unique()
u_years
Out[66]:
array([2023, 2024])
In [67]:
new_sensor['date'] = pd.to_datetime(new_sensor['date'])
n_years = new_sensor['date'].dt.year.unique()
n_years
Out[67]:
array([2023])
In [68]:
ped_data_2023 = new_ped[new_ped['date'].dt.year==2023]
ped_data_2023.head()
Out[68]:
locationid location timestamp pedestrian_count latitude longitude date
0 75 -37.81515276, 144.97467661 2023-04-24 21:00:00+00:00 1 -37.815153 144.974677 2023-04-24
1 75 -37.81515276, 144.97467661 2023-04-25 00:00:00+00:00 1 -37.815153 144.974677 2023-04-25
2 75 -37.81515276, 144.97467661 2023-04-25 01:00:00+00:00 1 -37.815153 144.974677 2023-04-25
3 75 -37.81515276, 144.97467661 2023-04-25 02:00:00+00:00 1 -37.815153 144.974677 2023-04-25
4 75 -37.81515276, 144.97467661 2023-04-25 08:00:00+00:00 1 -37.815153 144.974677 2023-04-25
In [69]:
ped_data_2023.shape
Out[69]:
(402348, 7)
In [ ]:
 
In [70]:
# Group by 'date', and aggregate
new_df = ped_data_2023.groupby('date').agg({
    'timestamp': 'first',  # choosing 'first' or 
    'location': 'first',   # Assuming location is consistent within each day
    'latitude': 'first',   # Taking the first latitude for the day
    'longitude': 'first',  # Taking the first longitude for the day
    'pedestrian_count': 'sum'  # Summing up pedestrian counts for the day
}).reset_index()
In [71]:
new_df=new_df.sort_values(by='date')
new_df.head()
Out[71]:
date timestamp location latitude longitude pedestrian_count
0 2023-03-31 2023-03-31 18:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 751
1 2023-04-01 2023-04-01 01:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1592
2 2023-04-02 2023-04-02 00:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1651
3 2023-04-03 2023-04-03 19:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1681
4 2023-04-04 2023-04-04 14:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1665
In [72]:
new_df['date'] = pd.to_datetime(new_df['date'])

# Find the first (earliest) date
first_date = new_df['date'].min()
print(first_date)

# Find the last (latest) date
last_date = new_df['date'].max()
print(last_date)
2023-03-31 00:00:00
2023-12-31 00:00:00
In [73]:
new_bstops.head()
Out[73]:
location str_id roadseg_id latitude longitude
0 -37.80384165792465, 144.93239283833262 1235255 21673 -37.803842 144.932393
1 -37.81548699581418, 144.9581794249902 1231226 20184 -37.815487 144.958179
2 -37.81353897396532, 144.95728334230756 1237092 20186 -37.813539 144.957283
3 -37.82191394843844, 144.95539345270072 1232777 22174 -37.821914 144.955394
4 -37.83316401267591, 144.97443745130263 1271914 22708 -37.833164 144.974437
In [ ]:
 
In [74]:
new_sensor.head()
Out[74]:
timestamp temp_avg light_avg humidity_avg location latitude longitude date
0 2023-04-24 21:00:00+00:00 19.4 0.9 52.7 -37.8197121, 144.9413253 -37.819712 144.941325 2023-04-24
1 2023-04-25 00:00:00+00:00 19.7 10.6 50.2 -37.8195002, 144.9418888 -37.819500 144.941889 2023-04-25
2 2023-04-25 01:00:00+00:00 19.7 3.1 57.9 -37.8199043, 144.9404851 -37.819904 144.940485 2023-04-25
3 2023-04-25 02:00:00+00:00 19.7 3.1 53.7 -37.8199043, 144.9404851 -37.819904 144.940485 2023-04-25
4 2023-04-25 08:00:00+00:00 18.7 1.0 48.6 -37.8197121, 144.9413253 -37.819712 144.941325 2023-04-25
In [75]:
sensor_df = new_sensor.groupby('date').agg({
    'timestamp': 'first',  # choosing 'first' or 'last'
    'location': 'first',   # Assuming location is consistent within each day
    'latitude': 'first',   # Taking the first latitude for the day
    'longitude': 'first',  # Taking the first longitude for the day
    'temp_avg': 'sum',  # Summing up temp_avg counts for the day
    'light_avg': 'sum', #  Summing up light_avg counts for the day
    'humidity_avg' : 'sum' #  Summing up humidity_avg counts for the day
}).reset_index()
In [76]:
sensor_df.head
Out[76]:
<bound method NDFrame.head of           date                 timestamp                  location   latitude  \
0   2023-04-24 2023-04-24 21:00:00+00:00  -37.8197121, 144.9413253 -37.819712   
1   2023-04-25 2023-04-25 00:00:00+00:00  -37.8195002, 144.9418888 -37.819500   
2   2023-04-26 2023-04-26 03:00:00+00:00  -37.8197121, 144.9413253 -37.819712   
3   2023-04-27 2023-04-27 00:00:00+00:00  -37.8195002, 144.9418888 -37.819500   
4   2023-04-28 2023-04-28 14:00:00+00:00   -37.813073, 144.9804061 -37.813073   
..         ...                       ...                       ...        ...   
120 2023-09-05 2023-09-05 00:00:00+00:00  -37.8199043, 144.9404851 -37.819904   
121 2023-09-06 2023-09-06 19:00:00+00:00  -37.8203537, 144.9404816 -37.820354   
122 2023-09-07 2023-09-07 00:00:00+00:00  -37.8146097, 144.9790177 -37.814610   
123 2023-09-08 2023-09-08 14:00:00+00:00  -37.8199043, 144.9404851 -37.819904   
124 2023-09-09 2023-09-09 02:00:00+00:00  -37.8199043, 144.9404851 -37.819904   

      longitude  temp_avg  light_avg  humidity_avg  
0    144.941325     742.8     2352.9        1828.6  
1    144.941889    5233.4    16790.6       13930.2  
2    144.941325    6052.5    18654.6       16527.7  
3    144.941889    1036.5     2086.7        2930.3  
4    144.980406    5336.0    14384.0       13506.6  
..          ...       ...        ...           ...  
120  144.940485   10453.4    28948.9       26069.5  
121  144.940482   11350.8    30754.3       27892.5  
122  144.979018    9841.6    24099.7       30575.6  
123  144.940485    8943.3    22530.0       24345.1  
124  144.940485    4849.6    11584.9       10295.3  

[125 rows x 8 columns]>
In [77]:
sensor_df=sensor_df.sort_values(by='date')
sensor_df.head()
Out[77]:
date timestamp location latitude longitude temp_avg light_avg humidity_avg
0 2023-04-24 2023-04-24 21:00:00+00:00 -37.8197121, 144.9413253 -37.819712 144.941325 742.8 2352.9 1828.6
1 2023-04-25 2023-04-25 00:00:00+00:00 -37.8195002, 144.9418888 -37.819500 144.941889 5233.4 16790.6 13930.2
2 2023-04-26 2023-04-26 03:00:00+00:00 -37.8197121, 144.9413253 -37.819712 144.941325 6052.5 18654.6 16527.7
3 2023-04-27 2023-04-27 00:00:00+00:00 -37.8195002, 144.9418888 -37.819500 144.941889 1036.5 2086.7 2930.3
4 2023-04-28 2023-04-28 14:00:00+00:00 -37.813073, 144.9804061 -37.813073 144.980406 5336.0 14384.0 13506.6
In [78]:
sensor_df.shape
Out[78]:
(125, 8)
In [79]:
#checking the date range
sensor_df['date'] = pd.to_datetime(new_df['date'])

# Find the first (earliest) date
first_date = sensor_df['date'].min()
print(first_date)

# Find the last (latest) date
last_date = sensor_df['date'].max()
print(last_date)
2023-03-31 00:00:00
2023-08-09 00:00:00
In [80]:
# Ensure the 'date' column is in datetime format
new_df['date'] = pd.to_datetime(new_df['date'])

# Define your date range
start_date = '2023-03-31'
end_date = '2023-08-09'

# Filter the DataFrame for the date range
filtered_ped = new_df[(new_df['date'] >= start_date) & (new_df['date'] <= end_date)]

filtered_ped.head()
Out[80]:
date timestamp location latitude longitude pedestrian_count
0 2023-03-31 2023-03-31 18:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 751
1 2023-04-01 2023-04-01 01:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1592
2 2023-04-02 2023-04-02 00:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1651
3 2023-04-03 2023-04-03 19:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1681
4 2023-04-04 2023-04-04 14:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1665
In [81]:
filtered_ped.shape
Out[81]:
(125, 6)
In [82]:
# Joining on latitude and longitude
merged_df = pd.merge(filtered_ped, sensor_df, on=['date'], how='inner')
In [83]:
merged_df.head()
Out[83]:
date timestamp_x location_x latitude_x longitude_x pedestrian_count timestamp_y location_y latitude_y longitude_y temp_avg light_avg humidity_avg
0 2023-03-31 2023-03-31 18:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 751 2023-04-24 21:00:00+00:00 -37.8197121, 144.9413253 -37.819712 144.941325 742.8 2352.9 1828.6
1 2023-04-01 2023-04-01 01:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1592 2023-04-25 00:00:00+00:00 -37.8195002, 144.9418888 -37.819500 144.941889 5233.4 16790.6 13930.2
2 2023-04-02 2023-04-02 00:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1651 2023-04-26 03:00:00+00:00 -37.8197121, 144.9413253 -37.819712 144.941325 6052.5 18654.6 16527.7
3 2023-04-03 2023-04-03 19:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1681 2023-04-27 00:00:00+00:00 -37.8195002, 144.9418888 -37.819500 144.941889 1036.5 2086.7 2930.3
4 2023-04-04 2023-04-04 14:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1665 2023-04-28 14:00:00+00:00 -37.813073, 144.9804061 -37.813073 144.980406 5336.0 14384.0 13506.6
In [84]:
#checking for null
missing_per = merged_df.isna().sum().sort_values()
missing_per
Out[84]:
date                0
timestamp_x         0
location_x          0
latitude_x          0
longitude_x         0
pedestrian_count    0
timestamp_y         0
location_y          0
latitude_y          0
longitude_y         0
temp_avg            0
light_avg           0
humidity_avg        0
dtype: int64

there is no missing value the merged_df dataframe

finding the relationship between temp_avg light_ag and humidity¶

In [85]:
#Checking for the distributions in all the important variables
merged_df[['pedestrian_count','temp_avg','light_avg','humidity_avg']].hist()
plt.tight_layout()
plt.show()
No description has been provided for this image

The histogram shows that the pedestrian_count has skewed distributio and is skewed to the left while, the temp_avg is a unimodal distribrution and is kind of skewed to the left, light_avg is a unimodal distribution and is also skewed to the left and the humidity_avg is also a unimodal distribution and is also skewed to the left

checking for outliers¶

In [86]:
import matplotlib.pyplot as plt

# Assuming merged_df is your DataFrame and it's already defined

fig, axs = plt.subplots(2, 2, figsize=(8, 5))  # Creates a grid of 2x2 for subplots

# Boxplot for 'pedestrian_count'
axs[0, 0].boxplot(merged_df['pedestrian_count'])
axs[0, 0].set_title('Pedestrian Count')

# Boxplot for 'humidity_avg'
axs[0, 1].boxplot(merged_df['humidity_avg'])
axs[0, 1].set_title('Humidity Avg')

# Boxplot for 'temp_avg'
axs[1, 0].boxplot(merged_df['temp_avg'])
axs[1, 0].set_title('Temp Avg')

# Boxplot for 'light_avg'
axs[1, 1].boxplot(merged_df['light_avg'])
axs[1, 1].set_title('Light Avg')

# Adding some layout improvements
plt.tight_layout()
plt.show()
No description has been provided for this image

trying to apply some transformation on the selected variables¶

removing outliers¶

In [87]:
import pandas as pd

def remove_outliers(df, column_name, multiplier=2):
    
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define the condition for outliers
    outliers_condition = (df[column_name] < (Q1 - multiplier * IQR)) | (df[column_name] > (Q3 + multiplier * IQR))
    
    # Filter out outliers
    filtered_df = df[~outliers_condition]
    
    return filtered_df
In [88]:
# Apply the function to remove outliers from columns
filtered_ped = remove_outliers(merged_df, 'pedestrian_count', multiplier=0.1)
filtered_temp = remove_outliers(merged_df, 'temp_avg', multiplier=0.7)
filtered_light = remove_outliers(merged_df, 'light_avg', multiplier=0.7)
filtered_humidity = remove_outliers(merged_df, 'humidity_avg', multiplier=0.7)

# Now, plot the boxplot for 'light_avg' without outliers
fig, axs = plt.subplots(2, 2, figsize=(8, 5))  # 2x2 subplot grid

axs[0, 0].boxplot(filtered_ped['pedestrian_count'])
axs[0, 0].set_title('Pedestrian Count (No Outliers)')

axs[0, 1].boxplot(filtered_temp['temp_avg'])
axs[0, 1].set_title('Temp Avg (No Outliers)')

axs[1, 0].boxplot(filtered_light['light_avg'])
axs[1, 0].set_title('Light Avg (No Outliers)')

axs[1, 1].boxplot(filtered_humidity['humidity_avg'])
axs[1, 1].set_title('Humidity Avg (No Outliers)')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [89]:
# Creating a new DataFrame from the cleaned series

# Extract the cleaned columns as Series
filtered_ped_series = filtered_ped['pedestrian_count']
filtered_temp_series = filtered_temp['temp_avg']
filtered_light_series = filtered_light['light_avg']
filtered_humidity_series = filtered_humidity['humidity_avg']

# Combine these Series into a new DataFrame
cleaned_df = pd.DataFrame({
    'pedestrian_count': filtered_ped_series,
    'temp_avg': filtered_temp_series,
    'light_avg': filtered_light_series,
    'humidity_avg': filtered_humidity_series
}, index=merged_df.index)

#Drop rows with NaN values that result from the removal process
cleaned_df = cleaned_df.dropna()

cleaned_df
Out[89]:
pedestrian_count temp_avg light_avg humidity_avg
1 1592.0 5233.4 16790.6 13930.2
2 1651.0 6052.5 18654.6 16527.7
4 1665.0 5336.0 14384.0 13506.6
5 1694.0 8398.9 25747.3 20227.6
7 1666.0 12468.8 26961.4 28442.9
... ... ... ... ...
97 1613.0 8711.8 27110.1 19422.6
98 1602.0 5223.1 13632.6 14095.2
102 1602.0 11039.2 26275.1 18190.8
103 1589.0 9485.8 25059.0 21473.0
104 1617.0 6250.9 17057.2 19049.8

64 rows × 4 columns

applying standardization to make features comparableby removing the effect scales¶

In [90]:
#applying standardization to make features comparableby removing the effect scales
column_to_transform = ['pedestrian_count','temp_avg','light_avg','humidity_avg']
scaler = StandardScaler()
#fitting the scaler on the variables
after_transform = cleaned_df.copy()
after_transform[column_to_transform] = scaler.fit_transform(cleaned_df[column_to_transform])
after_transform.head()
Out[90]:
pedestrian_count temp_avg light_avg humidity_avg
1 -1.717439 -2.129084 -1.581865 -2.051454
2 -0.594487 -1.677102 -1.177389 -1.531559
4 -0.328023 -2.072469 -2.104083 -2.136239
5 0.223936 -0.382352 0.361683 -0.791016
7 -0.308990 1.863431 0.625135 0.853295
In [ ]:
 
In [91]:
#Checking for the distributions in all the important variables
after_transform[['pedestrian_count','temp_avg','light_avg','humidity_avg']].hist()
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
 

checking for the correlations¶

In [92]:
# Calculate correlation matrix
correlation_matrix = after_transform[['pedestrian_count', 'temp_avg', 'light_avg', 'humidity_avg']].corr()

# Print correlation matrix
print(correlation_matrix)
                  pedestrian_count  temp_avg  light_avg  humidity_avg
pedestrian_count          1.000000  0.183063   0.185144      0.245412
temp_avg                  0.183063  1.000000   0.843165      0.768014
light_avg                 0.185144  0.843165   1.000000      0.650894
humidity_avg              0.245412  0.768014   0.650894      1.000000

plotting of the correlation¶

In [93]:
import seaborn as sns
# Plot correlation matrix heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

Pedestrian Count and Environmental Factors: The correlation coefficients between pedestrian count and environmental factors are all positive, indicating that an increase in these environmental variables is associated with an increase in pedestrian count. However, all these correlations are under 0.3, suggesting that while there is a positive relationship, it is relatively weak. The strongest correlation with pedestrian count is humidity (0.245412), followed by light (0.185144), and temperature (0.183063). This suggests that humidity might have a slightly more significant influence on pedestrian count compared to light and temperature, but overall, environmental factors as represented may not be the primary drivers of pedestrian traffic.

Temperature, Light, and Humidity Interrelationships: The environmental factors show stronger correlations with each other than with pedestrian count. Temperature and light have a very high correlation (0.843165), indicating that warmer conditions are often associated with brighter/lighter conditions, which aligns with natural expectations. Temperature and humidity are also strongly correlated (0.768014), suggesting that warmer temperatures tend to coincide with higher humidity levels. Light and humidity have a moderate to strong correlation (0.650894), indicating that brighter conditions might also be associated with higher humidity, though this relationship is not as strong as the other

using scatter plot to capture the pattern¶

In [94]:
plt.figure(figsize=(15, 5))
plt.subplot(1,3,1)

plt.scatter(after_transform['pedestrian_count'], after_transform['temp_avg'], color='red')
# Adding title and labels
plt.title(' Pedestrian Count vs. Temperature Average')
plt.xlabel('Pedestrian Count')
plt.ylabel('Temperature Average')

plt.subplot(1,3,2)
plt.scatter(after_transform['pedestrian_count'],after_transform['light_avg'])
plt.title(' Pedestrian Count vs. light Average')
plt.xlabel('Pedestrian Count')
plt.ylabel('light Average')


plt.subplot(1,3,3)
plt.scatter(after_transform['pedestrian_count'],after_transform['humidity_avg'], color='green')
plt.title(' Pedestrian Count vs. humidity Average')
plt.xlabel('Pedestrian Count')
plt.ylabel('humidity Average')

# Show plot
plt.show()
No description has been provided for this image

Pedestrian Count vs. Temperature: There doesn't appear to be a clear linear relationship. Instead, the distribution seems relatively flat, suggesting that temperature alone may not strongly predict pedestrian counts. Pedestrian Count vs. Light: This plot also does not show a strong linear trend, indicating the relationship, if present, may be complex or influenced by other factors. Pedestrian Count vs. Humidity: Similarly, this relationship does not display a strong linear pattern, suggesting other non-linear dynamics could be at play.

In [ ]:
 

trying explore how different environmental conditions affect the number of pedestrians. For instance, you could use regression analysis to determine if colder temperatures or lower light levels lead to fewer people walking in a particular area.¶

using the GAM¶

In [95]:
from pygam import LinearGAM, s
from sklearn.model_selection import train_test_split

x = after_transform[['temp_avg', 'light_avg', 'humidity_avg']].values
y = after_transform['pedestrian_count'].values

#splitting data into training and testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

#fitting a GAM with spline for each feature
gam = LinearGAM(s(0) + s(1) + s(2)).fit(x_train,y_train)
In [96]:
# Plotting the partial dependence for each feature
titles = ['Temperature', 'Light', 'Humidity']
fig, axs = plt.subplots(1, 3, figsize=(15, 5))
for i, ax in enumerate(axs):
    XX = gam.generate_X_grid(term=i)
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX))
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
    ax.set_title(titles[i])
plt.tight_layout()
plt.show()
No description has been provided for this image

For temperature, i notice that while our model identifies the general trend, there are points, particularly where temperature sees a sharp decrease, where our predictions diverge from the actual data. This suggests that there may be additional factors influencing temperature changes that our current model does not capture.

The light condition model shows a closer alignment between predicted and actual values, reflecting the model's ability to understand the light data's patterns reasonably well. However, there are still areas where the model fails to capture the precise peaks and troughs, which may be critical depending on how light levels affect pedestrian movement in our area of interest.

The humidity model displays the best fit among the three, with predictions closely following the observed data's fluctuations. It indicates that our model can reliably represent how humidity varies, which can be useful for predicting pedestrian traffic under different humidity conditions.

In [97]:
# Predict and calculate R^2 or other metrics
from sklearn.metrics import r2_score
y_pred = gam.predict(x_test)

y_pred
Out[97]:
array([ 0.04027138,  0.06700212, -0.66478474, -0.57428818,  0.3253433 ,
        0.15498348, -0.10962844,  0.35302462,  0.77912352, -0.02207699,
       -0.16657702,  0.00795595,  0.07327775])
In [98]:
r2 = r2_score(y_test, y_pred)
print(f'R^2 Score: {r2}')
R^2 Score: 0.015264458926790403
In [ ]:
 

using the polynomial regression¶

In [99]:
# Define the variables
variables = ['temp_avg', 'light_avg', 'humidity_avg']
# Function to plot each variable
def plot_results(X_train, X_test, y_train, y_test, model, var_name, ax):
    # Training the model
    model.fit(X_train, y_train)
    # Predictions for plotting
    x_range = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)
    y_range = model.predict(x_range)

    # Plotting on the provided axes
    ax.scatter(X_train, y_train, color='blue', label='Training data')
    ax.scatter(X_test, y_test, color='red', label='Test data')
    ax.plot(x_range, y_range, color='green', label='Polynomial fit')
    ax.set_title(f'Polynomial Regression for {var_name}')
    ax.set_xlabel(var_name)
    ax.set_ylabel('Pedestrian Count')
    ax.legend()

# Setting up the figure and axes for subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(18, 6))  # Adjust as necessary
fig.suptitle('Polynomial Regression Analysis')

# Create and plot models for each variable
for index, var in enumerate(variables):
    X_train = after_transform[[var]]
    X_test = after_transform[[var]]
    y_train = after_transform['pedestrian_count']
    y_test = after_transform['pedestrian_count']
    
    model = make_pipeline(PolynomialFeatures(2), LinearRegression())
    plot_results(X_train, X_test, y_train, y_test, model, var, axes[index])

plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjust layout to make room for the main title
plt.show()
No description has been provided for this image
In [ ]:
 

using lasso regularization L1¶

In [100]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

# Split the data into training and testing sets
train_set, test_set = train_test_split(after_transform, test_size=0.2, random_state=42)

# Function to create and evaluate a model with L1 regularization
def evaluate_model(X_train, X_test, y_train, y_test, degree=3, alpha=0.1):
    model = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, r2

# Environmental conditions
variables = ['temp_avg', 'light_avg', 'humidity_avg']

# Results dictionary to store results
results = {}

# Evaluate the model for each variable
for var in variables:
    X_train = train_set[[var]]
    X_test = test_set[[var]]
    y_train = train_set['pedestrian_count']
    y_test = test_set['pedestrian_count']
    
    mse, r2 = evaluate_model(X_train, X_test, y_train, y_test, degree=2, alpha=0.1)
    results[var] = {'MSE': mse, 'R2': r2}

# Display results
for result in results:
    print(f"Results for {result}:")
    print(f"  MSE: {results[result]['MSE']}")
    print(f"  R^2 Score: {results[result]['R2']}\n")
Results for temp_avg:
  MSE: 0.9910413996653803
  R^2 Score: -0.06147089783893023

Results for light_avg:
  MSE: 0.9939604516573769
  R^2 Score: -0.06459739562179712

Results for humidity_avg:
  MSE: 0.9436623929454342
  R^2 Score: -0.010724847453226216

In [101]:
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

# Assume data is loaded into after_transform
train_set, test_set = train_test_split(after_transform, test_size=0.2, random_state=42)

def evaluate_model(X_train, X_test, y_train, y_test, degree=3, alpha=0.1):
    model = make_pipeline(PolynomialFeatures(degree), Lasso(alpha=alpha))
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mse, r2, y_pred

variables = ['temp_avg', 'light_avg', 'humidity_avg']
results = {}

# Set up plotting
fig, axes = plt.subplots(1, len(variables), figsize=(18, 6))

# Evaluate the model for each variable and plot results
for idx, var in enumerate(variables):
    X_train = train_set[[var]]
    X_test = test_set[[var]]
    y_train = train_set['pedestrian_count']
    y_test = test_set['pedestrian_count']
    
    mse, r2, y_pred = evaluate_model(X_train, X_test, y_train, y_test, degree=2, alpha=0.1)
    results[var] = {'MSE': mse, 'R2': r2}
    
    ax = axes[idx]
    ax.scatter(y_test, y_pred, alpha=0.5)
    ax.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    ax.set_title(f'Actual vs Predicted ({var})')
    ax.set_xlabel('Actual Count')
    ax.set_ylabel('Predicted Count')
    ax.label_outer()

plt.tight_layout()
plt.show()

# Print results
for result in results:
    print(f"Results for {result}:")
    print(f"  MSE: {results[result]['MSE']}")
    print(f"  R^2 Score: {results[result]['R2']}\n")
No description has been provided for this image
Results for temp_avg:
  MSE: 0.9910413996653803
  R^2 Score: -0.06147089783893023

Results for light_avg:
  MSE: 0.9939604516573769
  R^2 Score: -0.06459739562179712

Results for humidity_avg:
  MSE: 0.9436623929454342
  R^2 Score: -0.010724847453226216

In [ ]:
 

Limited Predictive Power: None of the individual models provide strong predictive performance, as indicated by the low R^2 scores. This suggests that these factors do not have a strong individual impact on pedestrian counts or that their effects are nonlinear beyond what can be captured with a simple second-degree polynomial. This performance suggests that these variables alone are not strong predictors of pedestrian counts

In [ ]:
 

identify clusters of high pedestrian activity and assess the proximity of bus stops to these clusters. This can help determine whether bus stops are strategically located to serve areas with high pedestrian demand¶

In [102]:
new_bstops.head()
Out[102]:
location str_id roadseg_id latitude longitude
0 -37.80384165792465, 144.93239283833262 1235255 21673 -37.803842 144.932393
1 -37.81548699581418, 144.9581794249902 1231226 20184 -37.815487 144.958179
2 -37.81353897396532, 144.95728334230756 1237092 20186 -37.813539 144.957283
3 -37.82191394843844, 144.95539345270072 1232777 22174 -37.821914 144.955394
4 -37.83316401267591, 144.97443745130263 1271914 22708 -37.833164 144.974437
In [103]:
new_bstops.shape
Out[103]:
(309, 5)
In [104]:
new_df.head()
Out[104]:
date timestamp location latitude longitude pedestrian_count
0 2023-03-31 2023-03-31 18:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 751
1 2023-04-01 2023-04-01 01:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1592
2 2023-04-02 2023-04-02 00:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1651
3 2023-04-03 2023-04-03 19:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1681
4 2023-04-04 2023-04-04 14:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1665
In [105]:
##trying to inplement a dynamic date to the bustops datasets to be able to merge it with pedestrian count datasets
start_date = '2023-03-31'  # Set your starting date here
new_bstops['date'] = pd.date_range(start=start_date, periods=len(new_bstops), freq='D')
In [106]:
new_bstops.head()
Out[106]:
location str_id roadseg_id latitude longitude date
0 -37.80384165792465, 144.93239283833262 1235255 21673 -37.803842 144.932393 2023-03-31
1 -37.81548699581418, 144.9581794249902 1231226 20184 -37.815487 144.958179 2023-04-01
2 -37.81353897396532, 144.95728334230756 1237092 20186 -37.813539 144.957283 2023-04-02
3 -37.82191394843844, 144.95539345270072 1232777 22174 -37.821914 144.955394 2023-04-03
4 -37.83316401267591, 144.97443745130263 1271914 22708 -37.833164 144.974437 2023-04-04
In [107]:
pbus_df = pd.merge(new_df, new_bstops, on=[ 'date'], how='inner')
pbus_df.head()
Out[107]:
date timestamp location_x latitude_x longitude_x pedestrian_count location_y str_id roadseg_id latitude_y longitude_y
0 2023-03-31 2023-03-31 18:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 751 -37.80384165792465, 144.93239283833262 1235255 21673 -37.803842 144.932393
1 2023-04-01 2023-04-01 01:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1592 -37.81548699581418, 144.9581794249902 1231226 20184 -37.815487 144.958179
2 2023-04-02 2023-04-02 00:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1651 -37.81353897396532, 144.95728334230756 1237092 20186 -37.813539 144.957283
3 2023-04-03 2023-04-03 19:00:00+00:00 -37.81349441, 144.96515323 -37.813494 144.965153 1681 -37.82191394843844, 144.95539345270072 1232777 22174 -37.821914 144.955394
4 2023-04-04 2023-04-04 14:00:00+00:00 -37.81569416, 144.9668064 -37.815694 144.966806 1665 -37.83316401267591, 144.97443745130263 1271914 22708 -37.833164 144.974437
In [ ]:
 
In [108]:
#i want to check for the percentage of the missing alue in each column
missing_per = pbus_df.isna().sum().sort_values()/len(pbus_df)*100
missing_per
Out[108]:
date                0.0
timestamp           0.0
location_x          0.0
latitude_x          0.0
longitude_x         0.0
pedestrian_count    0.0
location_y          0.0
str_id              0.0
roadseg_id          0.0
latitude_y          0.0
longitude_y         0.0
dtype: float64
In [109]:
# Extract necessary columns and convert data types if necessary
pedestrian_data = pbus_df[['latitude_x', 'longitude_x', 'pedestrian_count']]
bus_stop_data = pbus_df[['latitude_y', 'longitude_y']]

# You may need to aggregate pedestrian counts by location if there are multiple measurements for the same location
pedestrian_data = pedestrian_data.groupby(['latitude_x', 'longitude_x']).sum().reset_index()
In [110]:
pedestrian_data.head()
Out[110]:
latitude_x longitude_x pedestrian_count
0 -37.821299 144.968793 8625
1 -37.820178 144.965089 5273
2 -37.820112 144.962919 5263
3 -37.820091 144.957587 1984
4 -37.819830 144.951025 5351

Initialize the MinMaxScaler¶

In [111]:
#scaler = StandardScaler()
#df_scaled = scaler.fit_transform(pedestrian_data[['latitude_x','longitude_x','pedestrian_count']])
#df_scaled = pd.DataFrame(df_scaled, columns=['latitude'])
#df_scaled = pd.DataFrame(df_scaled,columns=['latitude_x', 'longitude_x','pedestrian_count'])
#df_scaled.head()

#min_max_scaler = MinMaxScaler()
#df_scaled = min_max_scaler.fit_transform(pedestrian_data[['latitude_x', 'longitude_x','pedestrian_count']])
#df_scaled = pd.DataFrame(df_min_max_scaled, columns=['latitude_x', 'longitude_x','pedestrian_count'])


# Initialize the MinMaxScaler
min_max_scaler = MinMaxScaler()

# Assuming pedestrian_data is your DataFrame and has the appropriate columns
df_scaled = min_max_scaler.fit_transform(pedestrian_data[['latitude_x', 'longitude_x', 'pedestrian_count']])

# Convert the scaled array back into a DataFrame with the correct variable
df_scaled = pd.DataFrame(df_scaled, columns=['latitude_x', 'longitude_x', 'pedestrian_count'])
In [112]:
df_scaled.hist()
Out[112]:
array([[<Axes: title={'center': 'latitude_x'}>,
        <Axes: title={'center': 'longitude_x'}>],
       [<Axes: title={'center': 'pedestrian_count'}>, <Axes: >]],
      dtype=object)
No description has been provided for this image
In [ ]:
 

using silhouette scores to find the optimal k¶

In [113]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


# List to store silhouette scores for each k
k_values = range(2,10)
silh_scores = []

# Iterate over each value of k
for k in k_values:
    # Fit KMeans clustering model
    kmeans = KMeans(n_clusters=k, n_init=10) # n_init is just to remove the warning message displaying with the output
    cluster_labels = kmeans.fit_predict(df_scaled)
    
    # Calculate silhouette score
    silhouette_avg = silhouette_score(df_scaled, cluster_labels)
    silh_scores.append(silhouette_avg)
    
# Find the optimal k value
optimal_k = k_values[np.argmax(silh_scores)]
print("Optimal number of clusters (k):", optimal_k)

# Plot silhouette scores
plt.plot(k_values, silh_scores, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Different Values of k')
plt.show()
Optimal number of clusters (k): 5
No description has been provided for this image

using elbow method to find the optimal k¶

In [114]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

from yellowbrick.cluster import KElbowVisualizer

model = KMeans(n_init=10)
visualizer = KElbowVisualizer(
    model,  k=(2,12), metric='distortion', timings=False
) #distortion same as Euclidean distance

visualizer.fit(df_scaled) 
Out[114]:
KElbowVisualizer(ax=<Axes: >, estimator=KMeans(n_clusters=11, n_init=10),
                 k=(2, 12), timings=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KElbowVisualizer(ax=<Axes: >, estimator=KMeans(n_clusters=11, n_init=10),
                 k=(2, 12), timings=False)
KMeans(n_clusters=11, n_init=10)
KMeans(n_clusters=11, n_init=10)
No description has been provided for this image

using kmean clustering¶

In [115]:
k = 5 # Example cluster number

#Initialize and fit the K-means model
kmeans = KMeans(n_clusters=k, random_state=0).fit(df_scaled[['latitude_x', 'longitude_x']])

# Assign the clusters back to the original data without scaling in other to maintain the origina lat and long co-ordinate
pedestrian_data['cluster'] = kmeans.labels_
centroids_scaled = kmeans.cluster_centers_
In [116]:
centroids_scaled
Out[116]:
array([[0.14311346, 0.42272218],
       [0.37047605, 0.76357048],
       [0.99603709, 0.00698777],
       [0.15690566, 0.84640875],
       [0.69269993, 0.62984482]])
In [117]:
#centroids = scaler.inverse_transform(centroids_scaled)
#centroids
In [118]:
pedestrian_data.head()
Out[118]:
latitude_x longitude_x pedestrian_count cluster
0 -37.821299 144.968793 8625 3
1 -37.820178 144.965089 5273 3
2 -37.820112 144.962919 5263 3
3 -37.820091 144.957587 1984 0
4 -37.819830 144.951025 5351 0
In [119]:
#evaluating
#score = silhouette_score(df_scaled, kmeans.labels_)

#print("Silhouette Score evaluation for eblow: ", score)

The result of my silhouette score above shows a good clustering

trying with Kmean++ to compare¶

In [120]:
kmean_pp = KMeans(n_clusters=5, n_init="auto", init='k-means++')
y_pred_pp= kmean_pp.fit_predict(df_scaled)
y_pred_pp
Out[120]:
array([0, 0, 0, 2, 2, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 2, 0, 0, 0, 2, 2, 2, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       4, 4])
In [121]:
kmean_pp_ev = silhouette_score(df_scaled, kmean_pp.labels_)
print(kmean_pp_ev)
0.4888518899612205

Kmean++ is having the same evaluation with KMean

In [122]:
k = 5 #cluster number

# Initialize and fit the K-means model
kmeans_sil = KMeans(n_clusters=k, random_state=0).fit(df_scaled[['latitude_x', 'longitude_x']])

# Assign clusters back to the original data
df_scaled['cluster'] = kmeans_sil.labels_

Evaluating my clustering using silhouette score¶

In [123]:
#Evaluation for sillhotte_score
ade = silhouette_score(df_scaled, kmeans_sil.labels_)

print("Silhouette Score evaluation for silhouette_score: ", ade)
Silhouette Score evaluation for silhouette_score:  0.8350167250565035
In [ ]:
 

getting the co-ordinates of clusters center¶

In [124]:
# Get the coordinates of the cluster centers
centroids = kmeans.cluster_centers_
print("Cluster centers (latitude, longitude):")
print(centroids)

#centroids = scaler.inverse_transform(centroids)
#print(centriods)
Cluster centers (latitude, longitude):
[[0.14311346 0.42272218]
 [0.37047605 0.76357048]
 [0.99603709 0.00698777]
 [0.15690566 0.84640875]
 [0.69269993 0.62984482]]
In [ ]:
 
In [ ]:
 
In [125]:
bus_stop_data.head()
Out[125]:
latitude_y longitude_y
0 -37.803842 144.932393
1 -37.815487 144.958179
2 -37.813539 144.957283
3 -37.821914 144.955394
4 -37.833164 144.974437

Calculate the distance from each bus stop to each cluster centroid¶

In [126]:
from scipy.spatial.distance import cdist

# Calculate the distance from each bus stop to each cluster centroid
distances = cdist(bus_stop_data[['latitude_y', 'longitude_y']], centroids, metric='euclidean')

# Find the nearest cluster for each bus stop
nearest_cluster = np.argmin(distances, axis=1)
bus_stop_data['nearest_cluster'] = nearest_cluster
bus_stop_data['distance_to_nearest_cluster'] = np.min(distances, axis=1)
In [127]:
bus_stop_data.head()
Out[127]:
latitude_y longitude_y nearest_cluster distance_to_nearest_cluster
0 -37.803842 144.932393 3 149.002648
1 -37.815487 144.958179 3 149.030551
2 -37.813539 144.957283 3 149.029188
3 -37.821914 144.955394 3 149.029494
4 -37.833164 144.974437 3 149.050777
In [ ]:
 
In [ ]:
 
In [128]:
# Summary statistics or further analysis
print(bus_stop_data.describe())
       latitude_y  longitude_y  nearest_cluster  distance_to_nearest_cluster
count  268.000000   268.000000            268.0                   268.000000
mean   -37.810140   144.953397              3.0                   149.024564
std      0.015196     0.019662              0.0                     0.020332
min    -37.850563   144.900324              3.0                   148.978346
25%    -37.821857   144.946211              3.0                   149.014375
50%    -37.808027   144.957858              3.0                   149.027687
75%    -37.798180   144.967256              3.0                   149.037041
max    -37.776973   144.987731              3.0                   149.064477

plotting the map to show the clustering in other to make some recommendataion¶

In [129]:
import folium

# Create a map centered around an average coordinate
map_center_latitude = pedestrian_data['latitude_x'].mean()
map_center_longitude = pedestrian_data['longitude_x'].mean()
m = folium.Map(location=[map_center_latitude, map_center_longitude], zoom_start=13)

# Plot each pedestrian data point
for idx, row in pedestrian_data.iterrows():
    folium.CircleMarker(
        location=[row['latitude_x'], row['longitude_x']],
        radius=5,
        color='blue',
        fill=True,
        fill_color='blue',
        popup=f'Ped Count: {row["pedestrian_count"]}, Cluster: {row["cluster"]}'
    ).add_to(m)

# Plot bus stops
for idx, row in bus_stop_data.iterrows():
    folium.Marker(
        location=[row['latitude_y'], row['longitude_y']],
        icon=folium.Icon(color='red', icon='info-sign'),
        popup=f'Bus Stop near Cluster: {row["nearest_cluster"]}'
    ).add_to(m)

# Plot centroids
for centroid in centroids:
    folium.Marker(
        location=centroid,
        icon=folium.Icon(color='green', icon='star'),
        popup='Cluster Center'
    ).add_to(m)

# Display the map
m
Out[129]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Strategic Placement: The data and visualization suggests that almost 50%-60% bus stops are strategically placed near high pedestrian activity areas, potentially increasing bus ridership and convenience for pedestrians. This strategic placement can lead to more efficient public transportation systems and higher satisfaction among commuters.

Areas for Improvement: We noticed that a few stops are quite far from where most people are walking or gathering which means there are high distances to the nearest pedestrian clusters. which suggests that there might be room for improvement in bus stop placement.

Potential Solutions: "To address this, we need to think about possibly moving some bus stops or adding new ones where they're needed most. Another approach could be improving the walkways and paths leading to bus stops, making them safer and more inviting

In [ ]:
 
In [ ]:
 

conclusion¶

In this project, the following achievements were realized:¶
  • Actionable Recommendations:

Provided data-driven recommendations to optimize bus stop placement and improve accessibility for pedestrians. Offered insights that will support continuous improvement in urban mobility and infrastructure planning.

  • Refinement of Data Segmentation Techniques:

Enhanced the K-means clustering algorithm to improve the segmentation of pedestrian data. Achieved more accurate identification of high pedestrian activity clusters, leading to strategic recommendations for the placement of pedestrian-related services such as bus stops.

  • Detailed Analysis of Pedestrian Traffic Patterns:

Provided insights into how these factors influence foot traffic, offering valuable information for urban planning and resource allocation

we gained several key insights and learning outcomes:¶
  • Advanced Data Segmentation:

Clustering Techniques: Enhanced understanding of clustering methodologies, particularly K-means, and their application in segmenting pedestrian data. Identifying Patterns: Developed the ability to interpret cluster patterns effectively, informing actionable recommendations for urban planningg:

  • Strategic Infrastructure Planning:

Optimal Placement: Gained insights into how data-driven analysis can inform strategic placement of public transit infrastructure, like bus stops, to maximize accessibilit

  • Advanced visualization techniques with matplotlib and geopandas for geospatial data anal

  • Impact of Environmental Factors on Pedestrian Traffic:

Understanding Variables: Learned how environmental factors such as temperature and light levels significantly influence pedestrian movement patterns

At a broad level¶

Peak Pedestrian Days: Monday, Wednesday, and Saturday were identified as the days with the highest pedestrian traffic. These days exhibit significantly higher pedestrian volumes compared to other days of the week, indicating a pattern of increased activity that could be attributed to specific weekly events, market days, or leisure activities typical of these days.

Lowest Pedestrian Traffic: Friday was observed to have the lowest pedestrian count among all days of the week. This reduction in pedestrian activity could reflect a weekly variation in social or commercial patterns, such as alternative entertainment options, travel patterns, or shoppi

The data and visualization suggests that almost 50%-60% bus stops are strategically placed near high pedestrian activity areas, potentially increasing bus ridership and convenience for pedestrians. This strategic placement can lead to more efficient public transportation systems and higher satisfaction among commuters.ngysis.on

references¶

[2] Victorian 'Crash-Stat's dataset https://discover.data.vic.gov.au/dataset/crash-stats-data-extract/resource/392b88c0-f010-491f-ac92-531c293de2e9

[3] Pedestrian Routes Dataset https://data.melbourne.vic.gov.au/Transport/Pedestrian-Network/4id4-tydi

Technical References

[4] Accessing geoJSON data https://stackoverflow.com/questions/48263802/finding-location-using-geojson-file-using-python

[5] Accessing geoJSON data https://medium.com/analytics-vidhya/measure-driving-distance-time-and-plot-routes-between-two-geographical-locations-using-python-39995dfea7e

[6] Visualising a geoJSON dataset https://python-visualization.github.io/folium/quickstart.html#GeoJSON/TopoJSON-Overlays

[7] Visualising categorised data on a map https://www.geeksforgeeks.org/python-adding-markers-to-volcano-locations-using-folium-package/

[8] Creating point plot group layers with folium https://towardsdatascience.com/creating-an-interactive-map-of-wildfire-data-using-folium-in-pythoiveTimeSeries.html